Multicode: A Truly Multilingual Approach to text Encoding
نویسنده
چکیده
priate for use in different countries have increased demand for a standard character set for use with many different languages. Currently, the ASCII character set1 is the world’s most widely accepted and used standard character set for computers, operating systems, compilers, and e-mail systems. However, while ASCII encoding adequately represents English text, it does not address the problem of handling text in other languages. ASCII is a 7-bit code and defines only 128 characters. When used with an 8-bit character format, the 128 characters that ASCII would not use could be used as extensions to ASCII. These extensions would be used to define characters of different languages. For example, the ISO 8859 standard2 defines a Latin extension (that supports many European languages), as well as Cyrillic, Arabic, Greek, Hebrew, and other language extensions. To handle documents that mix English with a second language, you must use the ASCII extension for the second language. However, this approach presents two key problems:
منابع مشابه
Interactive multilingual text generation for a monolingual user
In this paper we describe an approach to machine translation which involves multilingual text generation via interaction with the user, who is monolingual: the system will work in a specific and fairly restricted domain. Key techniques used include the use of examples rather than linguistic rules to give the equivalents between the languages, and the encoding of contextual knowledge in the form...
متن کاملChapter 4 Character encoding in corpus construction
Corpus linguistics has developed, over the past three decades, into a rich paradigm that addresses a great variety of linguistic issues ranging from monolingual research of one language to contrastive and translation studies involving many different languages. Today, while the construction and exploitation of English language corpora still dominate the field of corpus linguistics, corpora of ot...
متن کاملAgainst multilinguality
1. Introduction An obvious assumption of the present workshop is that multilingual corpora are useful, and should be built and investigated. In the present paper, I would like to point out that this is far from straightforward and actually remains to be proved. In addition, and in a more constructive vein, I want to present some examples that show that the right encoding depends crucially on wh...
متن کاملTowards a Language Independent Encoding of Documents: A Novel Approach to Multilingual Question Answering
Given source text in several languages, can one answer queries in some other language, without translating any of the sources into the language of the questioner? In this paper we try to address this question as we report our work on a restricted domain, multilingual Question – Answering system, with current implementations for source text in English and questions posed in English and Hindi. Th...
متن کاملA truly multilingual, high coverage, accurate, yet simple, subsentential alignment method
This paper describes a new alignment method that extracts high quality multi-word alignments from sentence-aligned multilingual parallel corpora. The method can handle several languages at once. The phrase tables obtained by the method have a comparable accuracy and a higher coverage than those obtained by current methods. They are also obtained much faster.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IEEE Computer
دوره 30 شماره
صفحات -
تاریخ انتشار 1997